Tagset Reduction Without Information Loss 
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Abstract 

A technique for reducing a tagset used 
for n-gram part-of-speech disambiguation 
is introduced and evaluated in an experi- 
ment. The technique ensures that all in- 
formation that is provided by the original 
tagset can be restored from the reduced 
one. This is crucial, since we are inter- 
ested in the linguistically motivated tags 
for part-of-speech disambiguation. The re- 
duced tagset needs fewer parameters for its 
statistical model and allows more accurate 
parameter estimation. Additionally, there 
is a slight but not significant improvement 
of tagging accuracy. 

1 Motivation 

Statistical part-of-speech disambiguation can be ef- 
ficiently done with n-gram models ( Church, 1988 ; 
Cutting et al., 1992| ) . These models are equivalent 
to Hidden Markov Models (HMMs) ( [Rabincr, 19891 ) 
of order n — 1. The states represent parts of speech 
(categories, tags), there is exactly one state for each 
category, and each state outputs words of a particu- 
lar category. The transition and output probabilities 
of the HMM are derived from smoothed frequency 
counts in a text corpus. 

Generally, the categories for part-of-speech tag- 
ging are linguistically motivated and do not reflect 
the probability distributions or co-occurrence prob- 
abilities of words belonging to that category. It is 
an implicit assumption for statistical part-of-speech 
tagging that words belonging to the same category 
have similar probability distributions. But this as- 
sumption does not hold in many of the cases. 

Take for example the word cliff which could be 
a proper (NP)[] or a common noun (NN) (ignoring 

1 A11 tag names used in this paper are inspired by 
those used for the LOB Corpus (Garside et al., 1987). 



capitalization of proper nouns for the moment). The 
two previous words are a determiner (AT) and an 
adjective (JJ). The probability of cliff being a com- 
mon noun is the product of the respective contextual 
and lexical probabilities p(NN|AT, JJ) • p(cliff\NN), 
regardless of other information provided by the ac- 
tual words (a sheer cliff vs. the wise Cliff). Obvi- 
ously, information useful for probability estimation 
is not encoded in the tagset. 

On the other hand, in some cases information not 
needed for probability estimation is encoded in the 
tagset. The distributions for comparative and su- 
perlative forms of adjectives in the Susanne Corpus 
( (Bampson, 1995 ) are very similar. The number of 
correct tag assignments is not affected when we com- 
bine the two categories. However, it docs not suffice 
to assign the combined tag, if we are interested in 
the distinction between comparative and superlative 
form for further processing. We have to ensure that 
the original (interesting) tag can be restored. 

There are two contradicting requirements. On the 
one hand, more tags mean that there is more infor- 
mation about a word at hand, on the other hand, 
the more tags, the severer the sparse-data problem 
is and the larger the corpora that are needed for 
training. 

This paper presents a way to modify a given 
tagset, such that categories with similar distribu- 
tions in a corpus are combined without losing infor- 
mation provided by the original tagset and without 
losing accuracy. 

2 Clustering of Tags 

The aim of the presented method is to reduce a 
tagset as much as possible by combining (cluster- 
ing) two or more tags without losing information and 
without losing accuracy. The fewer tags we have, the 
less parameters have to be estimated and stored, and 
the less severe is the sparse data problem. Incoming 
text will be disambiguated with the new reduced 



tagset, but we ensure that the original tag is still 
uniquely identified by the new tag. 

The basic idea is to exploit the fact that some of 
the categories have a very similar frequency distri- 
bution in a corpus. If we combine categories with 
similar distribution characteristics, there should be 
only a small change in the tagging result. The main 
change is that single tags are replaced by a cluster 
of tags, from which the original has to be identified. 
First experiments with tag clustering showed that, 
even for fully automatic identification of the origi- 
nal tag, tagging accuracy slightly increased when the 
reduced tagset was used. This might be a result of 
having more occurrences per tag for a smaller tagset, 
and probability estimates are preciser. 

2.1 Unique Identification of Original Tags 

A crucial property of the reduced tagset is that the 
original tag information can be restored from the 
new tag, since this is the information we are inter- 
ested in. The property can be ensured if we place a 
constraint on the clustering of tags. 

Let W be the set of words, C the set of clus- 
ters (i.e. the reduced tagset), and T the original 
tagset. To restore the original tag from a combined 
tag {cluster), we need a unique function 
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To ensure that there is such a unique function, 
we prohibit some of the possible combinations. A 
cluster is allowed if and only if there is no word in the 
lexicon which can have two or more of the original 
tags combined in one cluster. Formally, seeing tags 
as sets of words and clusters as sets of tags: 

Vc S C, h, t 2 G c, ti ^ t 2 , w e W : w G ii =>• w i 2 

(2) 

If this condition holds, then for all words w tagged 
with a cluster c, exactly one tag t wc fulfills 

w e t wc A t v 



yielding 
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So, the original tag can be restored any time and no 
information from the original tagset is lost. 

Example: Assume that no word in the lexicon can 
be both comparative (JJR) and superlative adjective 
(JJT). The categories are combined to {JJRJJT}. 
When processing a text, the word easier is tagged 
as {JJRJJT}. Since the lexicon states that easier 
can be of category JJR but not of category JJT, the 
original tag must be JJR. 



2.2 Criteria For Combining Tags 

The are several criteria that can determine the qual- 
ity of a particular clustering. 

1. Compare the trigram probabilities p(B\Xi, A), 
p(B\A,Xi), and p(Xi \A,B), i = 1,2. Combine 
two tags X\ and X2 , if these probabilities coin- 
cide to a certain extent. 

2. Maximize the probability that the training cor- 
pus is generated by the HMM which is described 
by the trigram probabilities. 

3. Maximize the tagging accuracy for a training 
corpus. 

Criterion (1) establishes the theoretical basis, 
while criteria (2) and (3) immediately show the ben- 
efit of a particular combination. A measure of sim- 
ilarity for (1) is currently under investigation. We 
chose (3) for our first experiments, since it was the 
easiest one to implement. The only additional effort 
is a separate, previously unused part of the train- 
ing corpus for this purpose, the clustering part. We 
combine those tags into clusters which give the best 
results for tagging of the clustering part. 

2.3 The Algorithm 

The total number of potential clusterings grows ex- 
ponential with the size of the tagset. Since we are 
interested in the reduction of large tagsets, a full 
search regarding all potential clusterings is not fea- 
sible. We compute the local maximum which can be 
found in polynomial time with a best-first search. 

We u se a slight modification of th e algorithm 
used by (Stolcke and Omohundro, 1994) for merging 
HMMs. Our task is very similar to theirs. Stolcke 
and Omohundro start with a first order HMM where 
every state represents a single occurrence of a word 
in a corpus, and the goal is to maximize the a pos- 
teriori probability of the model. We start with a 
second order HMM (since we use trigrams) where 
each state represents a part of speech, and our goal 
is to maximize the tagging accuracy for a corpus. 

The clustering algorithm works as follows: 

1. Compute tagging accuracy for the clustering 
part with the original tagset. 

2. Loop: 

(a) Compute a set of candidate clusters (obey- 
ing constraint (||) mentioned in section 
.1), each consisting of two tags from the 



previous step. 



(b) For each candidate cluster build the result- 
ing tagset and compute tagging accuracy 
for that tagset. 

(c) If tagging accuracy decreases for all combi- 
nations of tags, break from the loop. 

(d) Add the cluster which maximized the tag- 
ging accuracy to the tagset and remove the 
two tags previously used. 

3. Output the resulting tagset. 

2.4 Application of Tag Clustering 

Two standard trigram tagging procedures were per- 
formed as the baseline. Then clustering was per- 
formed on the same data and tagging was done with 
the reduced tagset. The reduced tagset was only in- 
ternally used, the output of the tagger consisted of 
the original tagset for all experiments. 

The Susanne Corpus has about 157,000 words and 
uses 424 tags (counting tags with indices denoting 
multi-word lexemes as sep arate tags). The tag s are 
based on the LOB tagset ( Garsidc et al., 1987 ). 

Three parts are taken from the corpus. Part A 
consists of about 127,000 words, part B of about 
10,000 words, and part C of about 10,000 words. 
The rest of the corpus, about 10,000 words, is not 
used for this experiment. All parts are mutually 
disjunct. 

First, part A and B were used for training, and 
part C for testing. Then, part A and C were used 
for training, and part B for testing. About 6% of the 
words in the test parts did not occur in the training 
parts, i.e. they are unknown. For the moment we 
only care about the known words and not about the 
unknown words (this is treated as a separate prob- 
lem) . Table [l] shows the tagging results for known 
words. 

Clustering was applied in the next steps. In the 
third experiment, part A was used for trigram train- 
ing, part B for clustering and part C for testing. In 
the fourth experiment, part A was used for trigram 
training, part C for clustering and part B for testing. 

The baseline experiments used the clustering part 
for the normal training procedure to ensure that bet- 
ter performance in the clustering experiments is not 
due to information provided by the additional part. 

Clustering reduced the tagset by 33 (third exp.), 
and 31 (fourth exp.) tags. The tagging results for 
the known words are shown in table El 

The improvement in the tagging result is too small 
to be significant. However, the tagset is reduced, 
thus also reducing the number of parameters without 
losing accuracy. Experiments with larger texts and 
more permutations will be performed to get precise 
results for the improvement. 



3 Conclusions 

We have shown a method for reducing a tagset used 
for part-of-speech tagging without losing informa- 
tion given by the original tagset. In a first exper- 
iment, we were able to reduce a large tagset and 
needed fewer parameters for the /(-grain model. Ad- 
ditionally, tagging accuracy slightly increased, but 
the improvement was not significant. Further inves- 
tigation will focus on criteria for cluster selection. 
Can we use a similarity measure of probability dis- 
tributions to identify optimal clusters? How far can 
we reduce the tagset without losing accuracy? 
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Table 1: Tagging results for the test parts in the clustering experiments. Exp. 1 and 2 are used as the 
baseline. 





Training 


Clustering 


Testing 


Result (known words) 


1. 


parts A and B 




part C 


93.7% correct 


2. 


parts A and C 




part B 


94.6% correct 


3. 


part A 


part B 


part C 


93.9% correct 


4. 


part A 


part C 


part B 


94.7% correct 



